\(X\) is set of predictors
\(Y\) is an outcome (could be continuous or discrete, including binary)
\(\hat Y (X)\) is a prediction of \(Y\) based on \(X\)
Univeristy of Michigan Biostatistics 699, Winter 2022
\(X\) is set of predictors
\(Y\) is an outcome (could be continuous or discrete, including binary)
\(\hat Y (X)\) is a prediction of \(Y\) based on \(X\)
smaller is better:
Squared error: \((Y-\hat Y (X))^2\)
Absolute error: \(|Y-\hat Y (X)|\)
Misclassification: \(1 - 1_{[Y=\hat Y(X)]}\)
Asymmetric versions of the above
Negative log-likelihood (if there is statistical model): \(-2\log f(Y|\hat \theta)\)
Calculate the loss for each of your fitted models
The model with the smallest loss is the ‘winner’
When using same data for both model fitting and model comparison, most overfit model will be selected
Using replacing negative log-likelihood with AIC criterion undoes this preference for overfit models
Closed-form, one step calculation
Applicable to any model in which number of estimated parameters can be properly counted
Estimates “in-sample” likelihood
Useful for non-nested comparison
Corrected version makes it useful even when \(n\) is small relative to number of parameters
Likelihood does not exist or is not of interest
Cannot easily count model size
Cross-validation is more general alternative
training data used for model fitting
Use these data to fit all models under consideration
validation data used for model comparison and selection
“estimating the performance of different models in order to choose the best one” (Hastie, Tibshirani, and Friedman 2009, p222)
testing data used for fitting selected model and conducting inference / prediction
“having chosen a final model, estimating its prediction error…on new data” (Hastie, Tibshirani, and Friedman 2009, p222)
See also Berk, Brown, and Zhao (2010)
training: design and build car. Everyone gets to practice on same course
validation: all cars race on new course, and choose the winner
(inappropriate/wrong) testing: use the winning time from validation to predict how winner will continue to perform
(correct) testing: race the winning car by itself on yet another new course
Both AIC and cross-validation allow you to use same data for both training and validation steps
AIC does not extend to loss functions beyond likelihood
Cross-validation allows you to choose loss functions other than negative-log-likelihood
Neither solves the problem of post-selection inference, i.e. still need testing data
Have \(n\) observations in data
For \(j=1,\ldots\)
2.a Create random partition into \(K\) equally sized parts containing \(n/K\) observations. Each part is “fold”
2.b For \(\ell=1,\ldots,K\)
Treat \(\ell\)th fold as validation data, remaining \(K-1\) folds collapsed together as training data. Fit candidate models
Numerically evaluate model, i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on \(\ell\)th fold
Average loss across all values of \(\ell\) and \(j\), choose model with best value
Refit selected model to full data
cf. Section 7.10, Hastie, Tibshirani, and Friedman (2009)
Have \(n\) observations in data
For \(j=1,\ldots\)
2.a Fit candidate models to randomly selected X% of data (training data)
2.b Numerically evaluate models, i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on remaining (100-X)% of data (validation data)
Average across all values of \(j\), choose model with best value
Refit selected model to full data
Resulting models tend to be overfit and unstable (to perturbations in data, changes to choice of significance threshold)
Backward selection not possible in \(p > n\) scenarios
Selecting variables may not be sufficient. Additionally / alternatively, may want to control variability of parameter estimates that make it into model
Ridge regression (Hoerl and Kennard 1970), LASSO (Tibshirani 1996), and Elastic Net (Zou and Hastie 2005)
Ridge / Lasso / ENet as alternatives to maximum likelihood estimates (MLEs)
Lasso as an alternative to stepwise procedures for model selection
Maximum of \(f(Y|\beta)\) gives the \[ \begin{aligned} \hat\beta_\mathrm{MLE} = \mathrm{argmax}_{\beta} f(Y|\beta) \end{aligned} \]
Limit total size of final estimates. Maximize \(f(Y|\theta)\) subject to \[ \begin{aligned} \mathrm{Ridge}: \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||{\beta}||_2 \leq t\\ \mathrm{Lasso}: \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||{\beta}||_1 \leq t \end{aligned} \] where \(||{\beta}||_2 = \sqrt{\sum_{j=1}^p \beta_j^2}\) and \(||{\beta}||_1 = \sum_{j=1}^p |\beta_j|\) and \(t\) is a “tuning parameter” (auxiliary parameter that cannot be directly estimated by data)
What do these constraints look like?
Smaller polygons correspond to smaller \(t\)s
Smaller polygons correspond to smaller \(t\)s
Idea of constraint intuitively very satisfying but would be helpful if this could be characterized as an unconstrained problem
Using Langrangian ideas, it can be shown \[ \begin{aligned} &\mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||\beta||_2 \leq t\\ & = \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||\beta||_2^2 \leq t^2\\ & = \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda\left(||\beta||_2^2 - t^2\right)\right\}\\ & = \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_2^2 \right\} \end{aligned} \]
Now \(\lambda\) is tuning parameter, sharing \(1-1\) correspondence with \(t\) (large \(t\) \(\Rightarrow\) small \(\lambda\))
For fixed \(\lambda\), maximize \(f(Y|\beta) - \lambda||\beta||_2^2\)
Good and Gaskins (1971)
\[ \begin{aligned} \mathrm{MSE}(\hat\beta)&=E\left[\left(\hat\beta-\beta\right)^\top\left(\hat\beta-\beta\right)\right]\\ &=E\left(\hat\beta^\top\hat\beta\right) - 2E\left(\hat\beta\right)^\top\beta + \beta^\top\beta\\ &=E\left(\hat\beta^\top\hat\beta\right) - E\left(\hat\beta\right)^\top E\left(\hat\beta\right)\\ &\quad + E\left(\hat\beta\right)^\top E\left(\hat\beta\right) -2E\left(\hat\beta\right)^\top\beta + \beta^\top\beta\\ &= \sum_{j=1}^p \mathrm{Var}(\hat\beta_j) + \left[E\left(\hat\beta\right)-\beta\right]^\top\left[E\left(\hat\beta\right)-\beta\right]\\ &= \sum_{j=1}^p \mathrm{Var}(\hat\beta_j) + \sum_{j=1}^p \mathrm{Bias}(\hat\beta_j)^2\\ &= \mathrm{Trace} \mathrm{Var}(\hat\beta) + \mathrm{Bias}(\hat\beta)^\top\mathrm{Bias}(\hat\beta) \end{aligned} \]
Ignoring the finite-sample bias, \[ \begin{aligned} \mathrm{MSE}\left(\hat\beta_\mathrm{MLE}\right) \approx \sigma^2 \mathrm{Trace}\left[\left( X^\top X\right)^{-1}\right] = \sigma^2 \sum_{j=1}^ p \dfrac{1}{e_j} \end{aligned} \]
where \(\sigma^2\) is the error variance and \(e_j\) is the \(j\)th eigenvalue of \(X^\top X\)
If focus is on (asymptotically) unbiased estimators, then \(\hat\beta_\mathrm{MLE}\) is usually preferred
But its variance, and therefore MSE, may be large if eigenvalues are small (one symptom of collinearity of predictors and/or high-dimension of predictor vector)
Can we decrease MSE by allowing for some bias?
Ridge estimator in a linear regression is available in closed-form (unlike Lasso). For a normal likelihood,
\[ \begin{aligned} \hat\beta_\mathrm{R}(\lambda) \equiv \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_2^2 \right\} = \left( X^\top X + \lambda I_p\right)^{-1} X^\top Y \end{aligned} \]
\[ \begin{aligned} \mathrm{MSE}\left(\hat\beta_\mathrm{R}(\lambda)\right) &= \mathrm{Variance} + \mathrm{Bias}^2 \\ & = \sigma^2 \sum_{j=1}^p \dfrac{e_j}{(e_j+\lambda)^2} + \lambda^2 \sum_{j=1}^p\dfrac{\alpha_j^2}{(e_j+\lambda)^2}, \end{aligned} \] where \(\alpha = Q \beta\), \(Q\) being the eigenvectors of \(X^T X\)
Since \(\frac{e_j}{(e_j+\lambda)^2} < \frac{1}{e_j}\), the ridge estimator (\(\hat\beta_\mathrm{R}(\lambda)\)) will always have lower variance than the MLE (\(\hat\beta_\mathrm{MLE}\))
However, this decrease in variance should not be at the expense of introducing too much bias (the \(\dfrac{\alpha_j^2}{(e_j+\lambda)^2}\) term)
This is called “bias-variance tradeoff”
Hoerl and Kennard (1970) showed that there is always \(\lambda >0\) that decreases MSE
Lasso does not have a closed-form solution
Lasso conducts variable selection penalty + shrinkage
Ridge will never set estimates exactly to zero, but it does do shrinkage. It outperforms the Lasso when predictors are highly correlated
Have \(n\) observations in data and grid of tuning parameters \(\{\lambda_1 \approx 0, \lambda_2, \ldots, \lambda_M = \infty\}\)
For \(j=1,\ldots\)
2.a Create random partition into \(K\) equally sized parts containing \(n/K\) observations. Each part is a fold
2.b For \(\ell=1,\ldots,K\)
Treat \(\ell\)th fold as validation data, remaining \(K-1\) folds collapsed together as training data. Fit model to each value of \(\lambda_m\)
Numerically evaluate model using each \(\lambda_m\), i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on \(\ell\)th fold
Average across all values of \(\ell\) and \(j\), choose model (\(\lambda_m\)) with best value
Refit selected model (\(\lambda_m\)) to full data
R package glmnet\[ \begin{aligned} \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda (1-\alpha)/2||\beta||_2^2 + \lambda \alpha ||\beta||_1\right\} \end{aligned} \]
Implements linear, logistic, and Poisson likelihoods and Cox partial likelihoods
Use glmnet when you have your specific tuning parameters in mind
Use cv.glmnet to use cross-validation to select your tuning parameters (you nearly always need to use cv.glmnet)
cv.glmnet auto-selects grid of \(\lambda\) values; uses cross-validation to pick \(\lambda\)
If you also want to select \(\alpha\) via cross-validation, you must do so yourself
Use glmnetUtils for some “Some quality-of-life functions to streamline the process of fitting elastic net models with the glmnet package”
https://cran.r-project.org/web/packages/glmnet/
https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf
https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet.pdf
https://cran.r-project.org/web/packages/glmnetUtils/vignettes/intro.html
args(glmnet::cv.glmnet);
## function (x, y, weights = NULL, offset = NULL, lambda = NULL,
## type.measure = c("default", "mse", "deviance", "class", "auc",
## "mae", "C"), nfolds = 10, foldid = NULL, alignment = c("lambda",
## "fraction"), grouped = TRUE, keep = FALSE, parallel = FALSE,
## gamma = c(0, 0.25, 0.5, 0.75, 1), relax = FALSE, trace.it = 0,
## ...)
## NULL
Model training always done via maximum penalized likelihood
You can choose the loss function for model selection (not all options apply to all outcome types)
glmnet example 1library(glmnet);
set.seed(100);
n = 200;#number obs
p = 1e3;#number covariates
rho = 0.025;#compound symmetric correlation
true_beta = c(0.3, 0.3, numeric(p - 2));
chol_x = chol(matrix(rho, nrow = p, ncol = p) +
diag(1 - rho, nrow = p));
sigma_y = sqrt(0.20);
x = matrix(rnorm(n * p), nrow = n) %*% chol_x;
y = x%*%true_beta + rnorm(n, sd = sigma_y);
1 / (1 + sigma_y^2 / drop(crossprod(chol_x%*%true_beta)))
## [1] 0.48
glmnet example 1: ridgeplot(glmnet(x, y, alpha = 0), xvar = "lambda")
glmnet example 1: ridgeex1_ridge_cv = cv.glmnet(x, y, nfolds = 5, alpha = 0); plot(ex1_ridge_cv);
glmnet example 1: ridgeex1_ridge <-
glmnet(x, y, lambda = ex1_ridge_cv$lambda.1se, alpha = 0) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
bind_cols(truth = true_beta,
method = "ridge",
partition = 1)
ggplot(ex1_ridge) +
geom_point(aes(x = truth,
y = estimate)) +
geom_abline(intercept = 0, slope = 1) +
theme(text = element_text(size = 24))
glmnet example 1: lassoplot(glmnet(x, y, alpha = 1), xvar = "lambda")
glmnet example 1: lassoex1_lasso_cv = cv.glmnet(x, y, nfolds = 5, alpha = 1); plot(ex1_lasso_cv);
glmnet example 1: lassoex1_lasso <-
glmnet(x, y, lambda = ex1_lasso_cv$lambda.1se, alpha = 1) %>%
tidy(return_zeros = TRUE) %>%
filter(term != "(Intercept)") %>%
bind_cols(truth = true_beta,
method = "lasso",
partition = 1)
ggplot(ex1_lasso) +
geom_point(aes(x = truth,
y = estimate)) +
geom_abline(intercept = 0, slope = 1) +
theme(text = element_text(size = 24))
glmnet example 1: elastic netlasso_fraction = 0.5; plot(glmnet(x, y, alpha = lasso_fraction), xvar = "lambda")
glmnet example 1: elastic netex1_enet_cv = cv.glmnet(x, y, nfolds = 5, alpha = lasso_fraction); plot(ex1_enet_cv);
glmnet example 1: elastic netex1_enet <-
glmnet(x, y, lambda = ex1_enet_cv$lambda.1se, alpha = lasso_fraction) %>%
tidy(return_zeros = TRUE) %>%
filter(term != "(Intercept)") %>%
bind_cols(truth = true_beta,
method = "enet",
partition = 1)
ggplot(ex1_enet) +
geom_point(aes(x = truth,
y = estimate)) +
geom_abline(intercept = 0, slope = 1) +
theme(text = element_text(size = 24))
| method | partition | RMSE(*10000) |
|---|---|---|
| ridge | 1 | 133.9 |
| ridge | 2 | 133.9 |
| lasso | 1 | 68.0 |
| lasso | 2 | 68.0 |
| enet | 1 | 58.0 |
| enet | 2 | 65.1 |
glmnet example 2set.seed(100);
n = 200;#number obs
p = 1e3;#number covariates
rho = 0.4;#compound symmetric correlation
true_beta = 0.6/p + numeric(p);
chol_x = chol(matrix(rho, nrow = p, ncol = p) +
diag(1 - rho, nrow = p));
sigma_y = sqrt(0.20);
x = matrix(rnorm(n * p), nrow = n) %*% chol_x;
y = x%*%true_beta + rnorm(n, sd = sigma_y);
1 / (1 + sigma_y^2 / drop(crossprod(chol_x%*%true_beta)))
## [1] 0.419
glmnet example 2: ridgeglmnet example 2: ridgeglmnet example 2: lassoglmnet example 2: lassoglmnet example 2: elastic netglmnet example 2: elastic net| method | partition | RMSE(*10000) |
|---|---|---|
| ridge | 1 | 2.64 |
| ridge | 2 | 3.06 |
| lasso | 1 | 28.79 |
| lasso | 2 | 27.98 |
| enet | 1 | 27.89 |
| enet | 2 | 27.89 |
Can also use Lasso as alternative to stepwise procedures:
| Full model | Most common (bootstrap) | Forward selected | Smallest AIC (bootstrap) | Ridge | Enet | Lasso | |
|---|---|---|---|---|---|---|---|
| area_worst | 10.69 | 11.10 | 11.57 | 8.22 | 0.63 | 1.05 | |
| compactness_worst | -1.31 | -1.47 | 0.17 | ||||
| concavepoints_worst | 2.47 | 2.90 | 2.53 | 2.42 | 0.59 | 0.97 | 1.31 |
| concavity_worst | 0.96 | 1.06 | 0.33 | 0.31 | 0.19 | ||
| fractal_dimension_worst | -0.14 | -0.03 | |||||
| perimeter_worst | 0.44 | -2.92 | -3.19 | 0.68 | 1.11 | ||
| radius_worst | -2.59 | 0.71 | 1.28 | 3.55 | |||
| smoothness_worst | 1.23 | 1.06 | 1.05 | 1.20 | 0.31 | 0.47 | 0.44 |
| symmetry_worst | 0.60 | 0.43 | 0.58 | 0.23 | 0.24 | 0.18 | |
| texture_worst | 1.74 | 1.73 | 1.72 | 1.75 | 0.49 | 0.84 | 0.93 |
Could we have used AIC here? For linear ridge regression, we have
\[ \begin{aligned} p_\mathrm{eff}(\lambda) = \mathrm{Trace}\left[ X \left( X^\top X + n\lambda I_p\right)^{-1} X^\top\right] < \mathrm{Trace}\left[ X \left( X^\top X\right)^{-1} X^\top\right] = p \end{aligned} \] So, sometimes yes but not generally possible for non-linear estimators or for non-ridge penalties
Subset selection can be written as penalized likelihood problem:
\[ \begin{aligned} \hat\beta_\mathrm{BSS}(\lambda) \equiv \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_0 \right\}, \\ ||\beta||_0 = \sum_{j=1}^p 1_{[\beta_j\neq 0]} \end{aligned} \] - Cost of \(\lambda\) for each non-zero coefficient estimate, but no additional penalty for magnitude of non-zero coefficient estimates.
Focus here was on model selection / variable selection
Less focus on inference, which is difficult to do after model selection
Re-sampling ideas for inference also exist
Berk, Richard, Lawrence Brown, and Linda Zhao. 2010. “Statistical Inference After Model Selection.” Journal of Quantitative Criminology 26 (2): 217–36.
Good, I. J., and R. A. Gaskins. 1971. “Nonparametric Roughness Penalties for Probability Densities.” Biometrika 58 (2): 255–77.
Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.
Hoerl, Arthur E., and Robert W. Kennard. 1970. “ Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12: 55–67.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B 58: 267–88.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society, Series B 67: 301–20.